1 Introduction

Among all cancer types, lung cancer is the most lethal among both males and females. It is responsible for 28% of cancer-related mortality. The life expectation for those who have lung cancer are very poor. It is usually diagnosed in an advanced phase, and the patients only have a 16% survival rate after 5 years of their diagnosis.

The Cancer Genome Atlas (TCGA) has comprehensively profiled this type of cancer in a patient cohort. Here, we analyze the expression profiles of those patients, using a pipeline based on the R/Bioconductor software package Rsubread.

This document is written in R markdown and should be processed using R and you need to install the packages knitr and markdown. Moreover, it using the official style for Bioconductor vignettes facilitated by the Bioconductor package BiocStyle. It is compiled from different files following the instructions included in the makefile.

2 Quality assessment

2.1 Data import

The assessment and analysis of lung squamous cell carcinoma starts by importing the raw table of counts.

library(SummarizedExperiment)

se.unpaired <- readRDS(file.path("rawCounts", "seLUSC.rds"))
se.unpaired
class: RangedSummarizedExperiment 
dim: 20115 553 
metadata(5): experimentData annotation cancerTypeCode
  cancerTypeDescription objectCreationDate
assays(1): counts
rownames(20115): 1 2 ... 102724473 103091865
rowData names(3): symbol txlen txgc
colnames(553): TCGA.18.3406.01A.01R.0980.07
  TCGA.18.3407.01A.01R.0980.07 ... TCGA.90.7767.11A.01R.2125.07
  TCGA.92.7340.11A.01R.2045.07
colData names(549): type bcr_patient_uuid ...
  lymph_nodes_aortic_pos_by_ihc lymph_nodes_aortic_pos_total
table(se.unpaired$type)

normal  tumor 
    51    502 

Information can already be obtained. For example, that the file contains 20115 genes from a total of 553 samples. Of this samples, 502 are tumor samples and 51 are normal samples, which matches the information of the dataset from The Cancer Genome Atlas (TCGA) program.

Next, the column (phenotypic) data can be explored, which in this case corresponds to clinical variables and their corresponding metadata. For example, the column “type” consists of tumor and normal labels. Other information such as gender, tumor stage, and smoking status is also shown.

In any case, all the possible labels can be accessed with the last line of code of the code block below, which returns two columns of information about the clinical variables. One called labelDescription contains a succint description of the variable, often not more self-explanatory than the variable name itself, and the other called CDEID corresponds to the so-called Common Data Element (CDE) identifier. This identifier can be used in https://cdebrowser.nci.nih.gov to search for further information about the associated clinical variable using the Advanced search form and the Public ID attribute search.

dim(colData(se.unpaired))
[1] 553 549
colData(se.unpaired)[1:5, 1:5]
DataFrame with 5 rows and 5 columns
                                 type                     bcr_patient_uuid
                             <factor>                             <factor>
TCGA.18.3406.01A.01R.0980.07    tumor 95b83006-02c9-4c4d-bf84-a45115f7d86d
TCGA.18.3407.01A.01R.0980.07    tumor 4e1ad82e-23c8-44bb-b74e-a3d0b1126b96
TCGA.18.3408.01A.01R.0980.07    tumor d4bc755a-2585-4529-ae36-7e1d88bdecfe
TCGA.18.3409.01A.01R.0980.07    tumor b09e872a-e837-49ec-8a27-84dcdcabf347
TCGA.18.3410.01A.01R.0980.07    tumor 99599b60-4f5c-456b-8755-371b1aa7074e
                             bcr_patient_barcode form_completion_date
                                        <factor>             <factor>
TCGA.18.3406.01A.01R.0980.07        TCGA-18-3406             2011-3-9
TCGA.18.3407.01A.01R.0980.07        TCGA-18-3407             2011-3-9
TCGA.18.3408.01A.01R.0980.07        TCGA-18-3408             2011-3-9
TCGA.18.3409.01A.01R.0980.07        TCGA-18-3409            2011-3-17
TCGA.18.3410.01A.01R.0980.07        TCGA-18-3410             2011-4-4
                             prospective_collection
                                           <factor>
TCGA.18.3406.01A.01R.0980.07                     NO
TCGA.18.3407.01A.01R.0980.07                     NO
TCGA.18.3408.01A.01R.0980.07                     NO
TCGA.18.3409.01A.01R.0980.07                     NO
TCGA.18.3410.01A.01R.0980.07                     NO
mcols(colData(se.unpaired), use.names=TRUE)
DataFrame with 549 rows and 2 columns
                                                         labelDescription
                                                              <character>
type                                           sample type (tumor/normal)
bcr_patient_uuid                                         bcr patient uuid
bcr_patient_barcode                                   bcr patient barcode
form_completion_date                                 form completion date
prospective_collection            tissue prospective collection indicator
...                                                                   ...
lymph_nodes_pelvic_pos_total                               total pelv lnp
lymph_nodes_aortic_examined_count                           total aor lnr
lymph_nodes_aortic_pos_by_he                          aln pos light micro
lymph_nodes_aortic_pos_by_ihc                                 aln pos ihc
lymph_nodes_aortic_pos_total                                total aor-lnp
                                        CDEID
                                  <character>
type                                       NA
bcr_patient_uuid                           NA
bcr_patient_barcode                   2673794
form_completion_date                       NA
prospective_collection                3088492
...                                       ...
lymph_nodes_pelvic_pos_total          3151828
lymph_nodes_aortic_examined_count     3104460
lymph_nodes_aortic_pos_by_he          3151832
lymph_nodes_aortic_pos_by_ihc         3151831
lymph_nodes_aortic_pos_total          3151827

The row (feature) data provides information on genes. For the first line of command, the length of each gene and the GC content can be seen. The second line of command provides further information, such as the ranges within individual chromosome.

rowData(se.unpaired)
DataFrame with 20115 rows and 3 columns
               symbol     txlen              txgc
          <character> <integer>         <numeric>
1                A1BG      3322 0.564419024683925
2                 A2M      4844 0.488232865400495
9                NAT1      2280 0.394298245614035
10               NAT2      1322 0.389561270801815
12           SERPINA3      3067 0.524942940984676
...               ...       ...               ...
100996331       POTEB      1706 0.430832356389215
101340251    SNORD124       104 0.490384615384615
101340252   SNORD121B        81 0.407407407407407
102724473      GAGE10       538 0.505576208178439
103091865   BRWD1-IT2      1028 0.592412451361868
rowRanges(se.unpaired)
GRanges object with 20115 ranges and 3 metadata columns:
            seqnames            ranges strand |      symbol     txlen
               <Rle>         <IRanges>  <Rle> | <character> <integer>
          1    chr19 58345178-58362751      - |        A1BG      3322
          2    chr12   9067664-9116229      - |         A2M      4844
          9     chr8 18170477-18223689      + |        NAT1      2280
         10     chr8 18391245-18401218      + |        NAT2      1322
         12    chr14 94592058-94624646      + |    SERPINA3      3067
        ...      ...               ...    ... .         ...       ...
  100996331    chr15 20835372-21877298      - |       POTEB      1706
  101340251    chr17 40027542-40027645      - |    SNORD124       104
  101340252     chr9 33934296-33934376      - |   SNORD121B        81
  102724473     chrX 49303669-49319844      + |      GAGE10       538
  103091865    chr21 39313935-39314962      + |   BRWD1-IT2      1028
                         txgc
                    <numeric>
          1 0.564419024683925
          2 0.488232865400495
          9 0.394298245614035
         10 0.389561270801815
         12 0.524942940984676
        ...               ...
  100996331 0.430832356389215
  101340251 0.490384615384615
  101340252 0.407407407407407
  102724473 0.505576208178439
  103091865 0.592412451361868
  -------
  seqinfo: 455 sequences (1 circular) from hg38 genome

In this analysis, a comparison between tumor and normal cells was done. For this, the dataset is subsetted to obtain only those samples which come from the same patient but have different type, setting a paired design. This strategy also has the benefit of reducing the possibility of false positives. In order to get them, the following code is executed.

# get the number of occurences of each patient
occur <- data.frame(table(substr(colnames(se.unpaired), 9, 12)))

# get those that occur twice
paired_df <- occur[occur$Freq > 1,]
paired <- as.vector(paired_df$Var1)
paired <- paired[2:length(paired)]

mask <- is.element(substr(colnames(se.unpaired), 9, 12), paired)

se <- se.unpaired[ ,mask]

saveRDS(se, file.path("results", "se.paired.rds"))

Since this data is unprocessed, a first step of quality assessment and normalization must be performed. To do this, the edgeR R/Bioconductor package is loaded. This requires the creation of a DGEList object, as the package does not work directly withSummarizedExperiment` objects. In any case, updates to one of the objects will be reflected in the other.

library(edgeR)

dge <- DGEList(counts=assays(se)$counts, genes=mcols(se))
saveRDS(dge, file.path("results", "dge.paired.rds"))

One way of normalizing is to use the counts per million (CPM) values. This value is calculated by dividing the number of counts of each sample by 1 million. Instead of using this directly, the \(log_2\) CPM values of expression are calculated as they have nicer distributional properties than raw counts or non-logged CPM units. This information is stored as an additional assay element to ease its manipulation.

assays(se)$logCPM <- cpm(dge, log=TRUE, prior.count=0.5)
assays(se)$logCPM[1:5, 1:5]
   TCGA.22.4593.01A.21R.1820.07 TCGA.22.4609.01A.21R.2125.07
1                      2.433139                     1.760377
2                      9.013463                    10.810625
9                     -6.817220                    -6.817220
10                    -6.817220                    -6.817220
12                     5.605909                     5.647356
   TCGA.22.5471.01A.01R.1635.07 TCGA.22.5472.01A.01R.1635.07
1                      2.122666                     0.775287
2                      7.713750                    11.100735
9                     -6.817220                    -6.817220
10                    -6.817220                    -6.817220
12                     4.823947                     5.856939
   TCGA.22.5478.01A.01R.1635.07
1                      3.713844
2                      9.448174
9                     -6.817220
10                    -6.817220
12                     4.920820

2.2 Sequencing depth

First, the sequencing depth in terms of total number of sequence read counts mapped to the genome per sample can be examined. Figure 1 below shows the sequencing depth per sample, also known as library size, in increasing order.

Library sizes in increasing order.

Figure 1: Library sizes in increasing order

This figure reveals no big changes in sequencing depth between samples. In addition, a uniform distribution of tumor and normal samples is seen across the figure. There is a cluster of normal samples on the right side with a higher CPM, which is not considered as problematic for further analyses. The identifiers of those samples with lower sequencing depth can be obtained using the commands below.

sampledepth <- round(dge$sample$lib.size / 1e6, digits=1)
names(sampledepth) <- substr(colnames(se), 6, 12)
sort(sampledepth)
43.6773 56.7823 77.7335 56.8309 56.8083 58.8386 77.7335 43.6773 56.7582 
   28.8    29.0    31.0    32.0    32.9    33.4    34.9    35.0    35.9 
56.8201 56.8201 56.8623 34.8454 77.7142 56.7579 56.7730 56.8083 56.8082 
   36.5    37.0    37.0    37.6    38.7    39.7    40.2    40.3    40.4 
56.8082 51.4081 34.8454 22.5481 56.8309 77.7337 33.4587 56.7823 56.7580 
   40.7    41.9    42.1    42.3    43.6    44.0    45.6    45.7    45.8 
56.7579 85.7710 58.8386 34.7107 56.7731 22.5471 77.7142 56.7580 56.7731 
   45.9    46.5    46.6    47.3    47.6    47.8    48.4    49.7    50.0 
85.7710 43.3394 39.5040 22.5472 60.2709 51.4079 90.6837 51.4080 77.7338 
   50.1    50.2    51.0    51.1    51.2    51.5    51.6    51.9    52.0 
56.7222 77.7338 77.7138 56.7582 22.5471 43.5670 22.4609 22.5483 92.7340 
   52.4    52.4    52.8    52.9    53.6    53.6    54.2    54.4    54.6 
77.8008 34.7107 90.7767 43.7657 22.4609 43.7657 77.7138 33.6737 43.7658 
   54.9    55.1    55.7    55.8    56.3    56.7    57.0    57.2    57.4 
77.7337 77.8007 43.5670 56.7222 56.8623 33.4587 43.7658 43.6143 22.5481 
   57.6    57.6    58.0    58.0    59.1    59.6    59.9    60.0    60.3 
22.5491 43.6647 92.7340 39.5040 22.5489 22.5482 77.8008 22.5489 22.5491 
   61.0    61.3    61.3    61.5    63.5    64.5    65.2    65.5    66.0 
77.8007 22.5478 90.6837 90.7767 56.7730 33.6737 60.2709 22.5478 43.6143 
   66.4    68.0    68.2    68.4    68.8    69.3    69.8    71.0    75.9 
22.5472 22.5482 22.4593 43.6771 22.5483 43.6647 22.4593 43.6771 51.4081 
   76.2    77.0    77.9    79.4    80.3    81.5    85.0    88.8   103.7 
51.4080 51.4079 43.3394 
  119.5   122.6   124.1 

2.3 Distribution of expression levels among samples

Next step is to look at the distribution of expression values per sample in terms of logarithmic CPM units. Tumor and normal samples are displayed separately, and are shown in Figure 2. Each colored line in the graphs below represent a sample. Both graphs have two modes: the first represents genes not expressed in that sample, while the second represents the genes that are expressed.

Non-parametric density distribution of expression profiles per sample.

Figure 2: Non-parametric density distribution of expression profiles per sample

In both graphs, some genes deviate a little from the rest, so they are noted as potential outliers.

2.4 Distribution of expression levels among genes

To look at the distribution of expression levels across genes, the average expression per gene through all the samples is calculated. Figure 3 shows the distribution of those values.

Distribution of average expression level per gene.

Figure 3: Distribution of average expression level per gene

2.5 Filtering of lowly-expressed genes

RNA sequence expression profiles that come from lowly-expressed genes can lead to artifacts in downstream differential expression analyses. Thus, it is common to set a threshold and filter out genes that fall below this value.

In the light of the plot above (Figure 3), a cutoff of 1 log CPM unit is considered as the minimum value of expression to select genes being expressed across samples. Using this cutoff we proceed to filter out lowly-expressed genes.

mask <- avgexp > 1
dim(se)
[1] 20115   102
se.filt <- se[mask, ]
dim(se.filt)
[1] 11872   102
dge.filt <- dge[mask, ]
dim(dge.filt)
[1] 11872   102

A this point, there are a total of 11872 genes and 102 samples. The unnormalized versions of the filtered expression data are stored.

saveRDS(se.filt, file.path("results", "se.paired.gfilt.unnorm.rds"))
saveRDS(dge.filt, file.path("results", "dge.paired.gfilt.unnorm.rds"))

2.6 Normalization

Next, the normalization factors on the filtered expression data set are calculated using the calcNormFactors function.

dge.filt <- calcNormFactors(dge.filt)

Then, the raw log2 CPM units in the corresponding assay element of the SummarizedExperiment object are replaced by the normalized ones.

assays(se.filt)$logCPM <- cpm(dge.filt, log=TRUE, normalized.lib.sizes=TRUE, prior.count=0.25)

Finally, the normalized versions of the filtered expression data are stored.

saveRDS(se.filt, file.path("results", "se.paired.gfilt.norm.rds"))
saveRDS(dge.filt, file.path("results", "dge.paired.gfilt.norm.rds"))

2.7 MA-plots

MA-plots of the normalized expression profiles allow us to visualize how one sample compares to the average of the rest of the samples, therefore assessing its quality. Tumor samples are in Figure 4.

MA-plots of the tumor samples.

Figure 4: MA-plots of the tumor samples

The MA-plots of the tumor samples provide the following insight:

  • 3 samples with high differences from the mean, with a threshold ( M > 2 or M < -2). Their MA plots have the regression line in red.
  • 14 samples with moderate differences from the mean, which we define as ( M <= 2 and M >= 1 or M >=-2 and M <= -1). Their regression line is brown.
  • 34 samples with discrete differences from the mean as they are (M < 1 or M > -1). Their regression line is green.

The appearance of samples that differ from the mean can be problematic. It is worth identifying the name of these samples to keep track of them in subsequent quality assessments. If they probe to be more problematic in downstream analysis, those samples should be removed together with their normal pairs.

big_c
[1] "TCGA.56.7823" "TCGA.56.8623" "TCGA.77.7138"

The normal samples are in Figure 5.

MA-plots of the normal samples.

Figure 5: MA-plots of the normal samples

In this case, the observations are:

  • 0 samples with high differences from the mean, with a threshold ( M > 2 or M < -2). Their MA plots have the regression line in red.
  • 0 samples with moderate differences from the mean, which we define as ( M <= 2 and M >= 1 or M >=-2 and M <= -1). Their regression line is brown.
  • 51 samples with discrete differences from the mean as they are (M < 1 or M > -1). Their regression line is green.

In conclusion, either important expression-level dependent biases are not observed among the normal samples.

2.8 Batch identification

Batch effects are sub-groups of measurements that have a qualitatively different behavior across conditions and are unrelated to the variables in the study. It is important to determine if batch effects are present to know if any confounding variables are affecting the data.

Potential surrogate of batch effect indicators within normal and tumor samples can be obtained. Given that each sample names corresponds to a TCGA barcode (see https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode), following the strategy described in http://bioinformatics.mdanderson.org/main/TCGABatchEffects:Overview, different elements of the TCGA barcode can be derived from it, and examine their distribution across samples.

tss <- substr(colnames(se.filt), 6, 7)
table(data.frame(TYPE=se.filt$type, TSS=tss))
        TSS
TYPE     22 33 34 39 43 51 56 58 60 77 85 90 92
  normal 10  2  2  1  8  3 12  1  1  7  1  2  1
  tumor  10  2  2  1  8  3 12  1  1  7  1  2  1
center <- substr(colnames(se.filt), 27, 28)
table(data.frame(TYPE=se.filt$type, CENTER=center))
        CENTER
TYPE     07
  normal 51
  tumor  51
plate <- substr(colnames(se.filt), 22, 25)
table(data.frame(TYPE=se.filt$type, PLATE=plate))
        PLATE
TYPE     0980 1100 1635 1758 1820 1858 1949 2045 2125 2187 2247 2296 2326
  normal    0    0    5    4    7    1    4   10   10    2    4    2    1
  tumor     1    3    6    0    7    0    4   10   10    2    4    2    1
        PLATE
TYPE     A28V
  normal    1
  tumor     1
portionanalyte <- substr(colnames(se.filt), 18, 20)
table(data.frame(TYPE=se.filt$type, PORTIONANALYTE=portionanalyte))
        PORTIONANALYTE
TYPE     01R 11R 21R 31R 41R
  normal  48   3   0   0   0
  tumor   11  28   8   2   2
samplevial <- substr(colnames(se.filt), 14, 16)
table(data.frame(TYPE=se.filt$type, SAMPLEVIAL=samplevial))
        SAMPLEVIAL
TYPE     01A 01B 11A
  normal   0   0  51
  tumor   50   1   0

From this information the following observations can be done:

  • All samples were sequenced at the same center
  • Tumor and normal samples were collected in the same vial except one:
colnames(se.filt)[samplevial == "01B"]
[1] "TCGA.56.7823.01B.11R.2247.07"
  • Samples were collected evenly across different tissue source sites (TSS).
  • Samples were mostly well distributed in different plates.
  • Samples of normal cells concentrate into two different portion of analyte, while the tumor samples have five different portions.

With those results, the portion of analyte is used as a surrogate of batch effect indicator. Considering the outcome of interest as the molecular changes between sample types, tumor vs. normal, the cross-classification of this outcome with portion of analyte are examined.

table(data.frame(TYPE=se.filt$type, PORTIONANALYTE=portionanalyte))
        PORTIONANALYTE
TYPE     01R 11R 21R 31R 41R
  normal  48   3   0   0   0
  tumor   11  28   8   2   2

It is shown how the majority of normal samples (48) come from the “01R”" analyte and 3 come from the “11R” analyte. However, the tumor samples span across five different analytes. This could potentially lead to a batch effect.

In order to examine how the samples group together, two approaches are used here: hierarchical clustering and multidimensional scaling, annotating the outcome of interest and the surrogate of batch indicator, or portion of analyte in this case.

The approaches are performed under a subset of the samples, as it is desirable to have a matrix without zeros. Samples belonging to portions 21R, 31R and 41R are removed, and as the experiment is paired, also their pairs.

por1 <- as.vector(substr(colnames(se.filt)[portionanalyte == "41R"], 9, 12))
por2 <- as.vector(substr(colnames(se.filt)[portionanalyte == "31R"], 9, 12))
por3 <- as.vector(substr(colnames(se.filt)[portionanalyte == "21R"], 9, 12))
allpor <- c(por1, por2, por3)

table(is.element(substr(colnames(se.filt), 9, 12), allpor))

FALSE  TRUE 
   78    24 
se.batch <- se.filt[ ,!is.element(substr(colnames(se.filt), 9, 12), allpor)]
dge.batch <- DGEList(counts=assays(se.batch)$counts, genes=mcols(se.batch))

new.portionanalyte <- substr(colnames(se.batch), 18, 20)
table(data.frame(TYPE=se.batch$type, PORTIONANALYTE=new.portionanalyte))
        PORTIONANALYTE
TYPE     01R 11R
  normal  36   3
  tumor   11  28
dim(dge.batch)
[1] 11872    78

The code above removed 24 samples, which correspond to the 12 in the indicated portions and their pairs.

The log CPM values are calculated again with a higher prior count to moderate extreme fold-changes produced by low counts. The resulting dendrogram is shown in Figure 6.

logCPM <- cpm(dge.batch, log=TRUE, prior.count=3)
d <- as.dist(1-cor(logCPM, method="spearman"))
sampleClustering <- hclust(d)
batch <- as.integer(factor(new.portionanalyte))
sampleDendrogram <- as.dendrogram(sampleClustering, hang=0.1)
names(batch) <- colnames(se.batch)
outcome <- paste(substr(colnames(se.batch), 9, 12), as.character(se.batch$type), sep="-")
names(outcome) <- colnames(se.batch)
sampleDendrogram <- dendrapply(sampleDendrogram,
                               function(x, batch, labels) {
                                 if (is.leaf(x)) {
                                   attr(x, "nodePar") <- list(lab.col=as.vector(batch[attr(x, "label")]))
                                   attr(x, "label") <- as.vector(labels[attr(x, "label")])
                                 }
                                 x
                               }, batch, outcome)
plot(sampleDendrogram, main="Hierarchical clustering of samples")
legend("topright", paste("Batch", sort(unique(batch)), levels(factor(new.portionanalyte))), fill=sort(unique(batch)))
Hierarchical clustering of the samples.

Figure 6: Hierarchical clustering of the samples

The dendogram shows how samples cluster primarily by sample type, tumor or normal. The different portions are present in both clusters, indicating that they do not induce batch effect. The plot also shows how one of the tumor samples, 8623-tumor, is present in the normal samples cluster. This sample also had a strong deviation in the MA plot.

In Figure 7 the corresponding MDS plot is shown. It can be seen that the first source of variation separates tumor from normal samples, and there is not any sample that is separated from any cluster.

plotMDS(dge.batch, labels=outcome, col=batch)
legend("bottomleft", paste("Batch", sort(unique(batch)), levels(factor(new.portionanalyte))),
       fill=sort(unique(batch)), inset=0.05)
Multidimensional scaling plot of the samples.

Figure 7: Multidimensional scaling plot of the samples

Finally, under the consideration that there is no batch effect, the 24 samples that were removed in order to make the dendogram and the MDS plot will remain for further analysis. Then, the dendogram is generated again in Figure 8 in order to see if any other sample appears in its opposite cluster, apart from 8623-tumor.

logCPM <- cpm(dge.filt, log=TRUE, prior.count=3)
d <- as.dist(1-cor(logCPM, method="spearman"))
sampleClustering <- hclust(d)
batch <- as.integer(factor(portionanalyte))
sampleDendrogram <- as.dendrogram(sampleClustering, hang=0.1)
names(batch) <- colnames(se.filt)
outcome <- paste(substr(colnames(se.filt), 9, 12), as.character(se.filt$type), sep="-")
names(outcome) <- colnames(se.filt)
sampleDendrogram <- dendrapply(sampleDendrogram,
                               function(x, batch, labels) {
                                 if (is.leaf(x)) {
                                   attr(x, "nodePar") <- list(lab.col=as.vector(batch[attr(x, "label")]))
                                   attr(x, "label") <- as.vector(labels[attr(x, "label")])
                                 }
                                 x
                               }, batch, outcome)
plot(sampleDendrogram, main="Hierarchical clustering of samples")
legend("topright", paste("Batch", sort(unique(batch)), levels(factor(new.portionanalyte))), fill=sort(unique(batch)))
Hierarchical clustering of the samples.

Figure 8: Hierarchical clustering of the samples

Again, only the 8623-tumor sample does not cluster with the rest of the tumor samples. This sample will be removed from the data before further analysis. For this, new files will be saved without those samples.

se.gfilt.norm <- se.filt
dge.gfilt.norm <- dge.filt

mask <- !is.element(substr(colnames(se.filt), 9, 12), as.character(8623))

se.filt <- se.gfilt.norm[ ,mask]
dge.filt <- DGEList(counts=assays(se.filt)$counts, genes=mcols(se.filt))

dim(se.filt)
[1] 11872   100
dim(dge.filt)
[1] 11872   100
saveRDS(se.filt, file.path("results", "se.paired.gsfilt.norm.rds"))
saveRDS(dge.filt, file.path("results", "dge.paired.gsfilt.norm.rds"))

This leaves a number of 11872 and 100 samples, of which 50 are tumoral and 50 are normal. This leaves a number of 11872 genes and 100 samples, of which 50 are tumoral and 50 normal.

2.9 Session information

sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.5

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] geneplotter_1.60.0          annotate_1.60.1            
 [3] XML_3.98-1.20               AnnotationDbi_1.44.0       
 [5] lattice_0.20-38             edgeR_3.24.3               
 [7] limma_3.38.3                SummarizedExperiment_1.12.0
 [9] DelayedArray_0.8.0          BiocParallel_1.16.6        
[11] matrixStats_0.54.0          Biobase_2.42.0             
[13] GenomicRanges_1.34.0        GenomeInfoDb_1.18.2        
[15] IRanges_2.16.0              S4Vectors_0.20.1           
[17] BiocGenerics_0.28.0         knitr_1.23                 
[19] BiocStyle_2.10.0           

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1             RColorBrewer_1.1-2     compiler_3.5.3        
 [4] BiocManager_1.30.4     highr_0.8              XVector_0.22.0        
 [7] bitops_1.0-6           tools_3.5.3            zlibbioc_1.28.0       
[10] bit_1.1-14             digest_0.6.19          memoise_1.1.0         
[13] RSQLite_2.1.1          evaluate_0.14          Matrix_1.2-17         
[16] DBI_1.0.0              yaml_2.2.0             xfun_0.7              
[19] GenomeInfoDbData_1.2.0 stringr_1.4.0          bit64_0.9-7           
[22] locfit_1.5-9.1         grid_3.5.3             rmarkdown_1.13        
[25] bookdown_0.11          blob_1.1.1             magrittr_1.5          
[28] codetools_0.2-16       htmltools_0.3.6        xtable_1.8-4          
[31] KernSmooth_2.23-15     stringi_1.4.3          RCurl_1.95-4.12       

3 Differential expression analysis

Differential expression analysis consists of identifying changes in gene expression. For RNAseq experiments, this means comparing relative concentrations of RNA molecules. In R, this can be done by using the limma and sva R/Bioconductor packages.

3.1 Analysis

limma allows for linear model analysis of data for DE. This requires building two model matrices: one including the outcome of interest and adjustment variables, and a matrix of the null model that only includes the adjustment variables. The outcome of interest is the type of sample (tumor vs normal), and as both types come from the same patient the analysis is paired, so a unique identifier for each patient is included.

Even if no known sources of variation were found, surrogate variables representing unknown sources of variation can be added to the model. Those can be estimated with the sva package, and later be appended to the model. Finally, voom is used to fit the model and the moderated t-statistics are calculated. This pipeline is chosen over limma-trend because the library size distribution is not uniform across samples.

library(sva)

# Obtain the patient ID, as the analysis is paired
patientid <- substr(colnames(se.filt), 9, 12)
se.filt$type <- relevel(se.filt$type, ref = "normal")

# Build the model (mod) and null (mod0) matrices 
mod <- model.matrix(~type + patientid, data = colData(se.filt))
mod0 <- model.matrix(~patientid, data = colData(se.filt))

# Estimate surrogate variables
sv <- sva(assays(se.filt)$logCPM, mod = mod, mod0 = mod0)
Number of significant surrogate variables is:  12 
Iteration (out of 5 ):1  2  3  4  5  
# Add surrogates to the design matrix
len <- length(colnames(mod))
mod <- cbind(mod, sv$sv)
colnames(mod) <- c(colnames(mod)[1:len], paste0("SV", 1:sv$n))
v <- voom(dge.filt, mod, plot=TRUE)
Mean-variance plot for DE analysis.

Figure 9: Mean-variance plot for DE analysis

fit <- lmFit(v, mod) # Fit the model with the voom weights
fit <- eBayes(fit) # Calculate moderated t-statistics

sva was able to estimate 12 surrogate variables. Regarding the mean-variance plot shown in Figure 9, which was obtained with voom, it shows the variance in the y axis and the mean of expression in the x axis for each gene, represented as a black dot. Such plots contain a red curve that is fitted to the dots and show trends in the data. Normally increases in expression levels mean a decrease in the variance, until a plateau is reached for highly expressed genes. This trend is observed in the plot obtained for this data. Moreover, the plot is also a diagnosis of filtering steps performed upstream. The fact that there is not a drop in variance for the lower expression values indicates that the filtering threshold was good.

As multiple tests are being performed, the probability of type I errors (rejecting the null hypotheses when it is true) increases. There are multiple statistical strategies consisting of applying corrections to the p-values to decrease the likelihood of these errors, such as the false discovery rate (FDR). This technique consists of defining an acceptable rate of type I errors (false discoveries). Here, a FDR of 0.01 is set, which means that 1% of the rejections of the null hypothesis will be incorrect. For this data set, which has 11872 observations, up to 119 genes can be false positives (FP).

FDRcutoff <- 0.01
res <- decideTests(fit, p.value = FDRcutoff)
summary(res)
       (Intercept) typetumor patientid3394 patientid4079 patientid4080
Down             5      4095             0            18             0
NotSig         784      2735         11870         11851         11871
Up           11083      5042             2             3             1
       patientid4081 patientid4587 patientid4593 patientid4609 patientid5040
Down               1             4             1             0             0
NotSig         11870         11867         11869         11871         11871
Up                 1             1             2             1             1
       patientid5471 patientid5472 patientid5478 patientid5481 patientid5482
Down               0             0             0             5             1
NotSig         11871         11870         11871         11866         11870
Up                 1             2             1             1             1
       patientid5483 patientid5489 patientid5491 patientid5670 patientid6143
Down               0             1             0             0             1
NotSig         11871         11869         11870         11871         11869
Up                 1             2             2             1             2
       patientid6647 patientid6737 patientid6771 patientid6773 patientid6837
Down               4             0             1             1             0
NotSig         11867         11868         11869         11869         11871
Up                 1             4             2             2             1
       patientid7107 patientid7138 patientid7142 patientid7222 patientid7335
Down               1             2             4             0             4
NotSig         11870         11870         11865         11871         11866
Up                 1             0             3             1             2
       patientid7337 patientid7338 patientid7340 patientid7579 patientid7580
Down               0             1             5             0             3
NotSig         11870         11870         11860         11871         11867
Up                 2             1             7             1             2
       patientid7582 patientid7657 patientid7658 patientid7710 patientid7730
Down               2             4             4            17             3
NotSig         11867         11866         11868         11851         11867
Up                 3             2             0             4             2
       patientid7731 patientid7767 patientid7823 patientid8007 patientid8008
Down               5             0             5             4             8
NotSig         11864         11869         11866         11867         11863
Up                 3             3             1             1             1
       patientid8082 patientid8083 patientid8201 patientid8309 patientid8386
Down               6             0             0             0             0
NotSig         11865         11870         11871         11872         11871
Up                 1             2             1             0             1
       patientid8454   SV1   SV2   SV3   SV4   SV5   SV6   SV7   SV8   SV9
Down               8  1442  2106   783   778    71     4    36    24    82
NotSig         11863  8792  9315 10848 10350 11628 11618 11791 11825 11740
Up                 1  1638   451   241   744   173   250    45    23    50
        SV10  SV11  SV12
Down      60    16     1
NotSig 11688 11822 11865
Up       124    34     6

The summary table displays the upregulated, not significant, and downregulated genes for each variable in the model.

Gene metadata such as the gene symbol and the chromosome can be added for an easier interpretation of the obtained statistics to the results table, which is shown below this lines.

# Get gene info...
genesmd <- data.frame(chr = as.character(seqnames(rowRanges(se.filt))), symbol = rowData(se.filt)[, 1], stringsAsFactors = FALSE)

# and add it to the results
fit$genes <- genesmd
tt <- topTable(fit, coef = 2, n = Inf)
head(tt, n = 20)
            chr  symbol    logFC  AveExpr        t      P.Value    adj.P.Val
171482     chrX   FAM9A 6.856160 1.477480 39.26673 1.888897e-34 2.242498e-30
51569     chr13    UFM1 5.871020 3.035598 35.50838 1.092477e-32 6.484941e-29
23217     chr19    ZFR2 5.161345 6.564764 33.75968 8.297272e-32 3.283507e-28
2261       chr4   FGFR3 4.340034 4.461243 32.83373 2.526705e-31 7.499260e-28
27316      chrX    RBMX 4.510505 3.627725 31.49972 1.324366e-30 2.913015e-27
6406      chr20   SEMG1 2.593916 4.942304 31.27703 1.757285e-30 2.913015e-27
284352    chr19 PPP1R37 5.460151 1.390301 31.37636 1.548654e-30 2.913015e-27
2079      chr14     ERH 4.296390 1.320862 31.14923 2.068708e-30 2.913015e-27
54536     chr10   EXOC6 4.827975 2.127249 31.09821 2.208316e-30 2.913015e-27
100616453 chr11 MIR4687 3.547233 5.839996 30.63877 3.994001e-30 4.343544e-27
9263       chr7  STK17A 5.560139 7.124666 30.63291 4.024510e-30 4.343544e-27
11252     chr22 PACSIN2 4.993374 1.672498 30.16299 7.441565e-30 7.362188e-27
81556     chr15    VWA9 4.719041 2.080915 30.02847 8.887556e-30 8.116389e-27
222546     chr6    RFX6 4.807681 2.002138 29.93165 1.010385e-29 8.568064e-27
51804     chr14    SIX4 3.237016 4.291469 29.80521 1.195294e-29 9.460352e-27
5782       chr7  PTPN12 3.302445 8.759576 29.68308 1.406851e-29 1.043884e-26
117156     chr5 SCGB3A2 4.533204 6.802293 29.56743 1.642521e-29 1.147059e-26
158062     chr9    LCN6 7.986699 6.999330 29.26947 2.454208e-29 1.618687e-26
29889      chr1    GNL2 5.919431 1.866216 29.20485 2.678874e-29 1.673874e-26
692205     chr2 SNORD89 4.627202 4.105932 29.12152 2.999938e-29 1.780763e-26
                 B
171482    68.07999
51569     64.28491
23217     62.53609
2261      61.37201
27316     59.68472
6406      59.49557
284352    59.42536
2079      59.09763
54536     59.08835
100616453 58.67169
9263      58.66579
11252     57.89290
81556     57.73250
222546    57.60020
51804     57.57738
5782      57.38661
117156    57.25347
158062    56.85789
29889     56.65382
692205    56.62971

The results table includes relevant statistics such as raw and adjusted p-values. In the case of the later ones, it can be seen some of their values are pretty low, which means that the result is significant. The number of differentially expressed genes under the 0.01 FDR cuttoff can be checked by applying a filter to this table.

signDEgenes <- rownames(tt)[tt$adj.P.Val < FDRcutoff]
length(signDEgenes)
[1] 9137

The differential expression analysis results in 9137 significantly differentially expressed genes. Those can still be filtered by a log fold change cutoff of 2 and -2, with the following chromosome distribution.

tt.sign <- tt[tt$adj.P.Val < FDRcutoff,]
DEgenes <- rownames(tt.sign)[abs(tt.sign$logFC) > 2]

length(DEgenes)
[1] 1451
saveRDS(DEgenes, file.path("results", "DEgenes.rds"))

So we are finally calling differentially expressed to 1451 genes. As we added the chromosome information to the table of results, the chromosome distribution of these genes can be obtained.

#sort(table(tt$chr[tt$adj.P.Val < FDRcutoff]), decreasing = TRUE)
sort(table(tt$chr[abs(tt$logFC) > 2]), decreasing = TRUE)

               chr1               chr19               chr11 
                155                  99                  94 
               chr2               chr17               chr12 
                 90                  87                  83 
               chr9               chr16                chr5 
                 71                  68                  66 
               chr4                chr3               chr10 
                 64                  62                  61 
               chrX                chr6                chr7 
                 61                  60                  60 
              chr20               chr15               chr14 
                 47                  44                  41 
               chr8               chr22               chr13 
                 41                  30                  27 
              chr21               chr18 chr1_KI270766v1_alt 
                 23                  16                   1 
#plot(sort(table(tt$chr[tt$adj.P.Val < FDRcutoff & !grepl(".*[A|v][1]*$", tt$chr, perl=TRUE)]), decreasing = TRUE), main = "DE genes chromosome distribution", ylab = "Number of DE genes", xlab = "Chromosome", cex.axis = 0.7)

ttc <- tt.sign
ttc$chr <- sub( "^chr([X|Y|0-9]+.*)", "\\1", ttc$chr, perl = T)
ttc$chr <- sub( "(.*)_.*_.*$", "\\1A", ttc$chr, perl = T)

plot(sort(table(ttc$chr[abs(ttc$logFC) > 2 & !grepl(".*[A|v][1]*$", ttc$chr, perl=TRUE)]), decreasing = TRUE), main = "DE genes chromosome distribution", ylab = "Number of DE genes", xlab = "Chromosome", cex.axis = 0.7)
Chromosome distribution of the differentially expressed genes

Figure 10: Chromosome distribution of the differentially expressed genes

The distribution is shown in Figure 10, and it does not correspond to a distribution based on the chromosome size, as some chromosomes are located higher in the ranking than what would be expected by its size. The chromosome Y does not have any differentially expressed gene.

3.2 Diagnostics

The distribution of raw p-values and the QQ plot moderated t-statistics, displayed in Figure 11, can be used as a diagnosis of the tests performed.

par(mfrow = c(1, 2), mar = c(4, 5, 2, 2))
hist(tt$P.Value, xlab = "Raw P-values", main = "", las = 1)
qqt(fit$t[, 2], df = fit$df.prior + fit$df.residual, main = "", pch = ".", cex = 3)
abline(0, 1, lwd = 2)
Raw p-values distribution and QQ plot of the moderated t-statistics for DE analysis.

Figure 11: Raw p-values distribution and QQ plot of the moderated t-statistics for DE analysis

Under the null hypothesis, which is that most genes are not differentially expressed, the raw p-value distribution should be uniform with a peak at low p-values for the truly DE genes. This coincides with the observed plot.

Regarding the QQ plot, it represents the quantiles of the data sample against the theoretical distribution they should follow. The diagonal represents the null hypothesis. Most genes are off-diagonal, indicating differential expression. The fact tha most of the genes are far from the null hypothesis is a measure of the significance of the results.

The differential expression results are diagnosed with volcano and MA plots, in which the top 10 differentially expressed genes are highlighted as shown in Figure 12. Volcano plots are scatter plots that summarize a measure of significance, such as the p-values (which can be in logarithmic scale as the plot below), and the log-fold change, which represents the measure of differential expression. Significant results will appear towards the plot, while fold changes increase to right for upregulated genes and to left for downregulated genes.

The top differentially expressed genes are tagged for both plots.

top <- order(fit$lods[, 2], decreasing = TRUE)[1:10]
fit$genes$symbol[top]
 [1] "FAM9A"   "UFM1"    "ZFR2"    "FGFR3"   "RBMX"    "SEMG1"   "PPP1R37"
 [8] "ERH"     "EXOC6"   "MIR4687"
par(mfrow = c(1, 2), mar = c(4, 5, 3, 2))

volcanoplot(fit, coef = 2, highlight = 10, names = fit$genes$symbol, main = "Volcano Plot")

limma::plotMA(fit, coef = 2, status = rownames(fit$lods) %in% signDEgenes, legend = FALSE,
main = "MA Plot", hl.pch = 46, hl.cex = 4, bg.pch = 46, bg.cex = 3, las = 1)
text(fit$Amean[top], fit$coef[top, 2], fit$genes$symbol[top], cex = 0.5, pos = 4)
Volcano and MA plots for DE analysis without SVA.

Figure 12: Volcano and MA plots for DE analysis without SVA

Both the volcano and MA plot show that the results do not show problems such as unexpected trends.

3.3 Session information

sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.5

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] GO.db_3.7.0                 GOstats_2.48.0             
 [3] Category_2.48.1             Matrix_1.2-17              
 [5] GSVA_1.30.0                 GSVAdata_1.18.0            
 [7] hgu95a.db_3.2.3             org.Hs.eg.db_3.7.0         
 [9] GSEABase_1.44.0             graph_1.60.0               
[11] sva_3.30.1                  genefilter_1.64.0          
[13] mgcv_1.8-28                 nlme_3.1-140               
[15] geneplotter_1.60.0          annotate_1.60.1            
[17] XML_3.98-1.20               AnnotationDbi_1.44.0       
[19] lattice_0.20-38             edgeR_3.24.3               
[21] limma_3.38.3                SummarizedExperiment_1.12.0
[23] DelayedArray_0.8.0          BiocParallel_1.16.6        
[25] matrixStats_0.54.0          Biobase_2.42.0             
[27] GenomicRanges_1.34.0        GenomeInfoDb_1.18.2        
[29] IRanges_2.16.0              S4Vectors_0.20.1           
[31] BiocGenerics_0.28.0         knitr_1.23                 
[33] BiocStyle_2.10.0           

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1             locfit_1.5-9.1         digest_0.6.19         
 [4] mime_0.7               R6_2.4.0               RSQLite_2.1.1         
 [7] evaluate_0.14          highr_0.8              zlibbioc_1.28.0       
[10] Rgraphviz_2.26.0       blob_1.1.1             rmarkdown_1.13        
[13] shinythemes_1.1.2      splines_3.5.3          stringr_1.4.0         
[16] RCurl_1.95-4.12        bit_1.1-14             shiny_1.3.2           
[19] compiler_3.5.3         httpuv_1.5.1           xfun_0.7              
[22] pkgconfig_2.0.2        htmltools_0.3.6        GenomeInfoDbData_1.2.0
[25] bookdown_0.11          codetools_0.2-16       AnnotationForge_1.24.0
[28] later_0.8.0            bitops_1.0-6           grid_3.5.3            
[31] RBGL_1.58.2            xtable_1.8-4           DBI_1.0.0             
[34] magrittr_1.5           stringi_1.4.3          XVector_0.22.0        
[37] promises_1.0.1         RColorBrewer_1.1-2     tools_3.5.3           
[40] bit64_0.9-7            survival_2.44-1.1      yaml_2.2.0            
[43] BiocManager_1.30.4     memoise_1.1.0         

4 Functional enrichment analysis

4.1 Gene set enrichment analysis (GSEA)

In addition to identifying differential expressed genes by adjusting surrogate variables in the differential expression analysis, a gene set enrichment analysis (GSEA) was also completed. This procedure provided more sensitivity in identifying gene expression changes. This analysis assessed how genes are behaving differently between two phenotypic states. The phenotypic states in this study were tumor and normal samples. The analysis calculated an enrichment score for each gene set. This score provided information on the changes in gene expression by inidividual genes in the gene set.

The first step in running a GSEA was to select a collection of gene sets from the MSigDB gene set collections provided by the Broad Institute. The C2 collection contained 3272 computational gene sets made up of cancer-oriented microarray data. This collection was well-curated and large, which made it an ideal candidate to use in the analysis.

library(GSEABase)
geneUniverse <- rownames(se.filt)
length(geneUniverse)
[1] 11872
library(GSVAdata) 
data(c2BroadSets) 
c2BroadSets
GeneSetCollection
  names: NAKAMURA_CANCER_MICROENVIRONMENT_UP, NAKAMURA_CANCER_MICROENVIRONMENT_DN, ..., ST_PHOSPHOINOSITIDE_3_KINASE_PATHWAY (3272 total)
  unique identifiers: 5167, 100288400, ..., 57191 (29340 total)
  types in collection:
    geneIdType: EntrezIdentifier (1 total)
    collectionType: BroadCollection (1 total)
length(c2BroadSets)
[1] 3272

After the collection was selected for the GSEA, the next step was to map the identifiers from the 3272 C2 gene sets to the dataset being analyzed.

gsc <- GeneSetCollection(c(c2BroadSets))

# Map identifiers
gsc <- mapIdentifiers(gsc, AnnoOrEntrezIdentifier(metadata(se.filt)$annotation))
gsc
GeneSetCollection
  names: NAKAMURA_CANCER_MICROENVIRONMENT_UP, NAKAMURA_CANCER_MICROENVIRONMENT_DN, ..., ST_PHOSPHOINOSITIDE_3_KINASE_PATHWAY (3272 total)
  unique identifiers: 5167, 100288400, ..., 57191 (29340 total)
  types in collection:
    geneIdType: EntrezIdentifier (1 total)
    collectionType: BroadCollection (1 total)

A matrix was created to check if the identifiers were properly mapped, or matched to the corresponding identifier in the data set being analyzed.

Im <- incidence(gsc)
dim(Im)
[1]  3272 29340
Im[1:2, 1:10]
                                    5167 100288400 338328 388 10631 440387
NAKAMURA_CANCER_MICROENVIRONMENT_UP    1         1      1   1     1      1
NAKAMURA_CANCER_MICROENVIRONMENT_DN    0         0      0   0     0      0
                                    54910 10136 3630 6662
NAKAMURA_CANCER_MICROENVIRONMENT_UP     1     1    1    1
NAKAMURA_CANCER_MICROENVIRONMENT_DN     0     0    0    0
Im <- Im[, colnames(Im) %in% rownames(se.filt)]
dim(Im)
[1] 3272 9963

Since the genes from the data set being analyzed are the only genes of interest, all genes that were not a part of this data set, denoted by a “0”, were discarded.

se.filtgsea <- se.filt[colnames(Im), ]
dge.filtgsea <- dge.filt[colnames(Im), ]

This left 11872 genes to be analyzed among 100 gene sets from the collection. With the data ready, a SVA was completed without calling any gene differentially expressed. The mean-variance was the same from the previous differential expression analysis so the plot was not displayed.

patientid <- substr(colnames(se.filtgsea), 9, 12)
mod <- model.matrix(~type + patientid, data = colData(se.filtgsea))
mod0 <- model.matrix(~patientid, data = colData(se.filtgsea))
sv <- sva(assays(se.filtgsea)$logCPM, mod, mod0)
Number of significant surrogate variables is:  12 
Iteration (out of 5 ):1  2  3  4  5  
# Add surrogates to the model
len <- length(colnames(mod))
mod <- cbind(mod, sv$sv)
colnames(mod) <- c(colnames(mod)[1:len], paste0("SV", 1:sv$n))

v <- voom(dge.filt, mod)
fit <- lmFit(v, mod)
fit <- eBayes(fit)
tt <- topTable(fit, coef = 2, n = Inf)

With the t-statistics avaialble, the z-scores were then calculated to see if a shift in gene expression within a gene set was present. First, a filter was set that required the gene sets to have a minimum size of 5 genes.

Im <- Im[rowSums(Im) >= 5, ]
dim(Im)
[1] 2760 9963

The t-statistics were then stored in a vector in order of the incidence matrix. The z-score was then calculated and sorted by the absolute score. The z-score of 5 indicated that this gene set was about 5 units above the mean. A low z-score suggested the genes in the gene set were not differentially expressed.

tGSgenes <- tt[match(colnames(Im), rownames(tt)), "t"]
length(tGSgenes)
[1] 9963
head(tGSgenes)
[1]   6.340600   4.823735   2.478672 -15.082266   3.014485   5.436968
zS <- sqrt(rowSums(Im)) * (as.vector(Im %*% tGSgenes)/rowSums(Im))
length(zS)
[1] 2760
head(zS)
 NAKAMURA_CANCER_MICROENVIRONMENT_UP  NAKAMURA_CANCER_MICROENVIRONMENT_DN 
                           -3.668597                            -3.829943 
WEST_ADRENOCORTICAL_TUMOR_MARKERS_UP WEST_ADRENOCORTICAL_TUMOR_MARKERS_DN 
                           -2.733001                            -7.887280 
                   WINTER_HYPOXIA_UP                    WINTER_HYPOXIA_DN 
                            3.749312                             9.144136 
rnkGS <- sort(abs(zS), decreasing = TRUE)
head(rnkGS)
       WHITEFORD_PEDIATRIC_CANCER_MARKERS 
                                 43.80714 
     GEORGES_TARGETS_OF_MIR192_AND_MIR215 
                                 41.32740 
         DODD_NASOPHARYNGEAL_CARCINOMA_DN 
                                 40.30265 
                 PUJANA_XPRSS_INT_NETWORK 
                                 39.89832 
                          KEGG_CELL_CYCLE 
                                 39.01311 
SPIELMAN_LYMPHOBLAST_EUROPEAN_VS_ASIAN_DN 
                                 38.80339 

A one sample z-test was calculated and a conservative multiple testing adjustment was administered to see if any gene sets were candidates for differentially expressed genes. The reason for completing a multiple testing adjustment was due to gene set overlaps.

pv <- pmin(pnorm(zS), 1 - pnorm(zS))
pvadj <- p.adjust(pv, method = "fdr")
DEgs <- names(pvadj)[which(pvadj < 0.01)]
length(DEgs)
[1] 2313

As shown, 2313 gene sets were differentially expressed.

4.2 Gene set variation analysis (GSVA)

In addition to the GSEA, a gene set variation analysis (GSVA) was performed. This differs from the standard GSEA by utilizing a gene set by sample matrix rather than a gene by sample matrix. This difference allowed for the pathway enrichment for each individual sample to be analyzed.

In order to do this, a matrix was created with the number of gene sets and an enrichment score that indicates sample-wise gene-level summaries of expression. Since gene sets with a small amount of genes don’t provide much information, a filter was set to remove those gene sets with 5 or less genes.

library(GSVA)
GSexpr <- gsva(assays(se.filt)$logCPM, gsc, min.sz=5, max.sz=300, verbose=FALSE)
dim(GSexpr)
[1] 2698  100

With the data ready, a SVA was completed without calling any gene differentially expressed.

mod <- model.matrix(~se.filt$type + patientid, data = colData(se.filt))
mod0 <- model.matrix(~ patientid, data = colData(se.filt))
svaobj <- sva(GSexpr, mod, mod0)
Number of significant surrogate variables is:  11 
Iteration (out of 5 ):1  2  3  4  5  
modSVs <- cbind(mod, svaobj$sv)

corfit <- duplicateCorrelation(GSexpr, modSVs, block = se.filt$cellline)
fit <- lmFit(GSexpr, modSVs)
fit <- eBayes(fit)
tt <- topTable(fit, coef = 2, n = Inf)
DEgs <- rownames(tt[tt$adj.P.Val < 0.01, , drop = FALSE])
length(DEgs)
[1] 1777

The 1777 gene sets that are differentially expressed at 1% FDR are shown in red in Figure 13.

Volcano and MA plots for GVSA

Figure 13: Volcano and MA plots for GVSA

4.3 Gene Ontology analysis

The next step was to complete a gene ontology analysis. This procedure was completed to see if any differentially expressed genes were associated with certain biological processes defined by the Gene Ontology system.

First, a parameter object was built that specified the gene universe of interest, the set of 635 DE genes, the ontology, and other information. The ontology selected was “BP”, which matched genes to the Biological Processes associated with the GO Terms.

geneUniverse <- rownames(se.filt)
length(geneUniverse)
[1] 11872
library(GOstats)
params <- new("GOHyperGParams", geneIds=DEgenes, universeGeneIds=geneUniverse,
            annotation="org.Hs.eg.db", ontology="BP",
            pvalueCutoff=0.01, testDirection="over")

In gene ontology analysis, GO terms can sometime overlap. Therefore, a conditional test was chosen. An Odds Ratio of “Inf” indicated that all genes within a gene set were differentially expressed.

conditional(params) <- TRUE
hgOverCond <- hyperGTest(params)
hgOverCond
Gene to GO BP Conditional test for over-representation 
7672 GO BP ids tested (44 have p < 0.01)
Selected gene set size: 1242 
    Gene universe size: 9840 
    Annotation package: org.Hs.eg 
goresults <- summary(hgOverCond)
head(goresults)
      GOBPID       Pvalue OddsRatio   ExpCount Count Size
1 GO:0007189 0.0005206625  2.638730  9.2140244    20   73
2 GO:0002011 0.0006846083  4.034760  3.7865854    11   30
3 GO:0010876 0.0007214380  1.821679 26.3798780    43  209
4 GO:0006821 0.0010321534  2.619485  8.3304878    18   66
5 GO:0007172 0.0011364481 27.777060  0.6310976     4    5
6 GO:0035627 0.0011364481 27.777060  0.6310976     4    5
                                                                       Term
1 adenylate cyclase-activating G protein-coupled receptor signaling pathway
2                                      morphogenesis of an epithelial sheet
3                                                        lipid localization
4                                                        chloride transport
5                                                   signal complex assembly
6                                                        ceramide transport

Since signficiantly enriched gene sets formed by a few genes or a lot of genes are not very reliable, a filter was added so those sets with less than or equal to 5 genes or more than 250 genes were removed. In addition, a filter was set on the Odds Ratio to find any above 1.5, which was a 50% increase over the null Odds Ratio of 1. The table was than ordered by Odds Ratio.

goresults <- goresults[goresults$Size >= 5 & goresults$Size <= 250  & goresults$OddsRatio >= 1.5, ]
goresults <- goresults[order(goresults$OddsRatio, decreasing=TRUE), ]
goresults
       GOBPID       Pvalue OddsRatio   ExpCount Count Size
5  GO:0007172 0.0011364481 27.777060  0.6310976     4    5
6  GO:0035627 0.0011364481 27.777060  0.6310976     4    5
7  GO:0046541 0.0011364481 27.777060  0.6310976     4    5
16 GO:0015886 0.0030676261 13.886914  0.7573171     4    6
27 GO:0001522 0.0064437510  9.256866  0.8835366     4    7
28 GO:0022010 0.0064437510  9.256866  0.8835366     4    7
29 GO:0038063 0.0064437510  9.256866  0.8835366     4    7
30 GO:0055064 0.0064437510  9.256866  0.8835366     4    7
20 GO:0045737 0.0046263593  6.946645  1.2621951     5   10
40 GO:0070875 0.0076078630  5.788197  1.3884146     5   11
33 GO:0048025 0.0072588377  4.632686  1.8932927     6   15
34 GO:1904754 0.0072588377  4.632686  1.8932927     6   15
14 GO:0006376 0.0028050015  4.281262  2.6506098     8   21
26 GO:0019369 0.0063707466  4.055466  2.3981707     7   19
2  GO:0002011 0.0006846083  4.034760  3.7865854    11   30
41 GO:0010171 0.0087513733  3.743071  2.5243902     7   20
23 GO:0003416 0.0053654763  3.709562  2.9030488     8   23
13 GO:1990573 0.0022246253  3.331508  4.2914634    11   34
25 GO:0031055 0.0059093774  3.026186  4.1652439    10   33
43 GO:0008333 0.0095090718  2.981230  3.7865854     9   30
21 GO:0034508 0.0046554698  2.946073  4.6701220    11   37
42 GO:0006336 0.0092703829  2.783442  4.4176829    10   35
1  GO:0007189 0.0005206625  2.638730  9.2140244    20   73
4  GO:0006821 0.0010321534  2.619485  8.3304878    18   66
18 GO:0098661 0.0042154616  2.374371  7.9518293    16   63
8  GO:0007187 0.0014057171  2.011321 16.4085366    29  130
22 GO:0042180 0.0047041428  1.950281 13.8841463    24  110
17 GO:0019935 0.0041579914  1.913764 15.2725610    26  121
3  GO:0010876 0.0007214380  1.821679 26.3798780    43  209
9  GO:0006631 0.0020002701  1.777542 23.7292683    38  188
15 GO:0030198 0.0029433800  1.730552 24.2341463    38  192
19 GO:0051222 0.0043438565  1.636833 28.0207317    42  222
32 GO:0042391 0.0071416215  1.554687 31.3024390    45  248
                                                                                          Term
5                                                                      signal complex assembly
6                                                                           ceramide transport
7                                                                             saliva secretion
16                                                                              heme transport
27                                                                     pseudouridine synthesis
28                                                          central nervous system myelination
29                               collagen-activated tyrosine kinase receptor signaling pathway
30                                                                    chloride ion homeostasis
20            positive regulation of cyclin-dependent protein serine/threonine kinase activity
40                                           positive regulation of glycogen metabolic process
33                                       negative regulation of mRNA splicing, via spliceosome
34                     positive regulation of vascular associated smooth muscle cell migration
14                                                                  mRNA splice site selection
26                                                          arachidonic acid metabolic process
2                                                         morphogenesis of an epithelial sheet
41                                                                          body morphogenesis
23                                                                    endochondral bone growth
13                                                 potassium ion import across plasma membrane
25                                                          chromatin remodeling at centromere
43                                                              endosome to lysosome transport
21                                                                 centromere complex assembly
42                                             DNA replication-independent nucleosome assembly
1                    adenylate cyclase-activating G protein-coupled receptor signaling pathway
4                                                                           chloride transport
18                                                     inorganic anion transmembrane transport
8  G protein-coupled receptor signaling pathway, coupled to cyclic nucleotide second messenger
22                                                           cellular ketone metabolic process
17                                                        cyclic-nucleotide-mediated signaling
3                                                                           lipid localization
9                                                                 fatty acid metabolic process
15                                                           extracellular matrix organization
19                                                    positive regulation of protein transport
32                                                            regulation of membrane potential

The table above identified GO terms of interest from the analysis. Some interesting terms related to lung cancer include signal complex assembly, ceramid transport, positive regulation of cyclin-dependent protein serine/threonine kinase actvity, and mRNA splice site selection.

4.4 Session information

sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.5

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] GO.db_3.7.0                 GOstats_2.48.0             
 [3] Category_2.48.1             Matrix_1.2-17              
 [5] GSVA_1.30.0                 GSVAdata_1.18.0            
 [7] hgu95a.db_3.2.3             org.Hs.eg.db_3.7.0         
 [9] GSEABase_1.44.0             graph_1.60.0               
[11] sva_3.30.1                  genefilter_1.64.0          
[13] mgcv_1.8-28                 nlme_3.1-140               
[15] geneplotter_1.60.0          annotate_1.60.1            
[17] XML_3.98-1.20               AnnotationDbi_1.44.0       
[19] lattice_0.20-38             edgeR_3.24.3               
[21] limma_3.38.3                SummarizedExperiment_1.12.0
[23] DelayedArray_0.8.0          BiocParallel_1.16.6        
[25] matrixStats_0.54.0          Biobase_2.42.0             
[27] GenomicRanges_1.34.0        GenomeInfoDb_1.18.2        
[29] IRanges_2.16.0              S4Vectors_0.20.1           
[31] BiocGenerics_0.28.0         knitr_1.23                 
[33] BiocStyle_2.10.0           

loaded via a namespace (and not attached):
 [1] bit64_0.9-7            splines_3.5.3          shiny_1.3.2           
 [4] statmod_1.4.32         BiocManager_1.30.4     highr_0.8             
 [7] RBGL_1.58.2            blob_1.1.1             GenomeInfoDbData_1.2.0
[10] yaml_2.2.0             RSQLite_2.1.1          digest_0.6.19         
[13] RColorBrewer_1.1-2     promises_1.0.1         XVector_0.22.0        
[16] htmltools_0.3.6        httpuv_1.5.1           pkgconfig_2.0.2       
[19] bookdown_0.11          zlibbioc_1.28.0        xtable_1.8-4          
[22] later_0.8.0            survival_2.44-1.1      magrittr_1.5          
[25] mime_0.7               memoise_1.1.0          evaluate_0.14         
[28] tools_3.5.3            stringr_1.4.0          locfit_1.5-9.1        
[31] compiler_3.5.3         grid_3.5.3             RCurl_1.95-4.12       
[34] AnnotationForge_1.24.0 bitops_1.0-6           rmarkdown_1.13        
[37] codetools_0.2-16       DBI_1.0.0              R6_2.4.0              
[40] bit_1.1-14             shinythemes_1.1.2      Rgraphviz_2.26.0      
[43] stringi_1.4.3          Rcpp_1.0.1             xfun_0.7              

5 References